INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE 



Maximum Cliques in Protein Structure Comparison 



o 

(N 

o 



N. Malod-Dognin — R. Andonov — N. Yanev 



PC 

c 



> 

d 
c 



TO 



N° 7053 

Octobre 2009 




ET EN AUTOMATIQUE 



EN INFORMATIQUE 



DE RECHERCHE 



^INKIA 



centre de recherche 

RENNES - BRETAGNE ATLANTIQUE 



Maximum Cliques in Protein Structure 
Comparison 



N. Malod-Dognin 



E, R. Andonov* , N. Yane^SI 



Theme : Bio 
Equipe-Projet Symbiose 



Rapport de recherche n° 7053 — Octobre 2009 — [TBI pages 



Abstract: Computing the similarity between two protein structures is a cru- 
cial task in molecular biology, and has been extensively investigated. Many pro- 
tein structure comparison methods can be modeled as maximum clique problems 
in specific /c-partite graphs, referred here as alignment graphs. 

In this paper, we propose a new protein structure comparison method based 
on internal distances (DAST) which is posed as a maximum clique problem in an 
alignment graph. We also design an algorithm (ACF) for solving such maximum 
clique problems. ACF is first applied in the context of VAST, a software largely 
used in the National Center for Biotechnology Information, and then in the 
context of DAST. The obtained results on real protein alignment instances show 
that our algorithm is more than 37000 times faster than the original VAST clique 
solver which is based on Bron & Kerbosch algorithm. We furthermore compare 
ACF with one of the fastest clique finder, recently conceived by Ostergard. On 
a popular benchmark (the Skolnick set) we observe that ACF is about 20 times 
faster in average than the Ostergard's algorithm. 

Key-words: protein structure comparison, maximum clique problem, k- 
partite graphs, combinatorial optimization. 
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Les cliques maximum dans la comparaison des 
structures proteiques 

Resume : Calculer la similarite entre deux structures de proteines est une 
tciche cruciale de la biologie moleculaire, et a ete etudiee intensement. De nom- 
breuses methodes de comparaison peuvent etre modelisees sous forme de re- 
cherches de cliques maximum dans des graphes fc-partis specifiques, que nous 
appellerons graphes d'alignments. 

Dans ce rapport, nous proposons unc nouvclle methode de comparaison de 
structures proteiques basee sur les distances internes (DAST), qui est formulee 
comme une recherche de cliques maximum dans un graphe d'alignement. Nous 
avons egalement concue un algorithme (ACF) pour resoudre de tels problemes 
de cliques. ACF est dans un premier temps applique dans le contexte de VAST, 
un logiciel laregement utilise au NCBI (National Center for Biotechnology In- 
formation), puis il est applique dans Ic contexte de DAST. Les resultats obtenus 
sur de veritables instances de comparaison de structures de proteines montrent 
que notre algorithme est plus de 37000 fois plus rapide que le solveur origi- 
nal de VAST, qui est base sur I'algorithme de Bron et Kerbosch. Nous avons 
ensuite compare ACF avec I'un des plus rapides algorithmes de recherche de 
clique maximum, recemment propose par Ostcrgard. Sur un jeu de test connu 
(I'ensemble de Skolnick), nous observons qu'ACF est en moyenne 20 fois plus 
rapide que I'algorithme d'Ostergard. 

Mots-cles : Compraraison de structures proteiques, probleme de clique maxi- 
mum, graphes fc-partis, optimisation combinatoire. 
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1 Introduction 

A fruitful assumption in molecular biology is that proteins of similar three- 
dimensional (3D) structures are likely to share a common function and in most 
cases derive from a same ancestor. Understanding and computing physical sim- 
ilarity of protein structures is one of the keys for developing protein based medi- 
cal treatments, and thus it has been extensively investigated [8l[T4j. Evaluating 
the similarity of two protein structures can be done by finding an optimal (ac- 
cording to some criterions) order-preserving matching (also called alignment) 
between their components. We show that finding such alignments is equivalent 
to solving maximum clique problems in specific /c-partite graphs referred here as 
alignment graphs. These graphs could be very large (more than 25000 vertices 
and 3 x 10'' edges) when comparing real protein structures. We are not aware 
of any previous specialized algorithm for solving the maximum clique problem 
in /c-partite graphs. Even very recent general cHque finders [inillSj are oriented 
to notably smaller instances and are not able to solve problems of such size (the 
available code of |T6| is limited to graphs with up to 1000 vertices). 

For solving the maximum clique problem in this context we conceive an 
algorithm, denoted by ACF (for Alignment CHque Finder), which profits from 
the particular structure of the ahgnment graphs. We furthermore compare 
ACF to an efficient general clique solver |13j and the obtained results clearly 
demonstrate the usefulness of our dedicated algorithm. 

1.1 The maximum clique problem 

We usually denote an undirected graph by G = {V,E), where V is the set of 
vertices and E is the set of edges. Two vertices i and j are said to be adjacent 
if they are connected by an edge of E. A clique of a graph is a subset of its 
vertex set, such that any two vertices in it are adjacent. 

Definition 1 The maximum clique problem ( also called maximum cardinal- 
ity clique problem) is to find a largest, in terms of vertices, clique of an arbitrary 
undirected graph G, which will be denoted by MCC{G). 

The maximum clique problem is one of the first problem shown to be NP- 
Complete [9] and it has been studied extensively in literature. Interested readers 
can refer to [3] for a detailed state of the art about the maximum clique problem. 

1.2 Alignment graphs 

In this paper, we focus on grid ahke graphs, which we define as follows. 

Definition 2 A m x n alignment graph G ~ (V, E) is a graph in which the 
vertex set V is depicted by a (m-rows) x (n-columns) array T , where each cell 
T[i][k] contains at most one vertex i.k from V (note that for both arrays and 
vertices, the first index stands for the row number, and the second for the column 
number). Two vertices i.k and j.l can be connected by an edge {i.k,j.l) e E only 
if i < j and k <l. An example of such alignment graph is given in Fig\^. 

It is easily seen that the m rows form a m-partition of the alignment graph G, 
and that the n columns also form a n-partition. In the rest of this paper we will 
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use the following notations. A successor of a vertex i.k d V is an element of the 
set r+{i.k) = {j.l e V s.t. {i.kj.l) e E,i < j and k < I}. V"-^ is the subset of 
V restricted to vertices in rows j, i < j < m, and in columns I, k < I < n. Note 
that r+(z.fc) C y^+i '^+i. G* '^ is the subgraph of G induced by the vertices in 
yi.k^ The cardinality of a vertex set U is \U\. 

1.3 Relations with protein structure similarity 

From a general point of view, two proteins Pi and P2 can be represented by 
their ordered set of components A^i and N2 , and estimating their similarity can 
be done by finding the longest ahgnment between the elements of Ni and N2. In 
our approach, such matchings are represented in a \Ni \ x |iV2| ahgnment graph 
G — {V, E), where each row corresponds to an element of Ni and each column 
corresponds to an element of N2. A vertex i.k is in V (i.e. matching i ^ fc is 
possible), only if element i £ Ni and k £ N2 are compatible. An edge {i.k, j.l) 
is in E if and only if (i) i < j and k < l,foT order preserving, and (ii) matching 
i ^ k is compatible with matching j ^ I. A feasible matching of Pi and P2 is 
then a clique in G, and the longest alignment corresponds to a maximum clique 
in G. There is a multitude of alignment methods and they differ mainly by 
the nature of the elements of A^i and iV2 and by the compatibility definitions 
between elements and between pairs of matched elements. At least two protein 
structure similarity related problems from the literature can be converted into 
clique problems in alignment graphs : the secondary structure alignment in 
VAST|Bj, and the Contact Map Overlap Maximization problem (CMO)0. 

VAST, or Vector Alignment Search Tool, is a software for aligning protein 
3D structures largely used in the National Center for Biotechnology Information 
0. In VAST, A^i and N2 contain 3D vectors representing the secondary structure 
elements (SSE) of Pi and P2 . Matching i ^ kis possible if vectors i and k have 
similar norms and correspond either both to a-helices or both to /3-strands. 
Finally, matching i ^ k is compatible with matching j ^ I only if the couple 
of vectors (i, j) from Pi can be well superimposed in 3D-space with the couple 
of vectors (fc, I) from P2. 

CMO is one of the most reliable and robust measures of protein structure 
similarity. Comparisons are done by aligning the residues (amino-acids) of two 
proteins in a way that maximizes the number of common contacts (when two 
residues that are close in 3D space are matched with two residues that are 
also close in 3D space). We have already dealt with CMO in [l], but not by 
using chques. Note that a maximum cHque formulation in ahgnment graphs 
was proposed by Strickland et al. in [l5], but this formulation differs from ours. 

1.4 DAST: an improvement of CMO based on internal 
distances 

One of the main drawback of CMO is that in order to maximize the number 
of common contacts, it also introduces some "errors" like aligning two residues 
that are close in 3D space with two residues that are remote, as illustrated 
in Fig [TJ These errors could potentially yield ahgnments with big root mean 
square deviations (RMSD) which is not desirable for structures comparison. 

^ http:/ / www.ncbi.nlm.nih.gov/Stmcture /VAST/ vast .shtml 
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To avoid such problems we propose DAST (Distance-based Alignment Search 
Figure 1: An optimal CMO matching. 




Two proteins ( Pi and P2) are represented by their contact map graphs where the vertices 
corresponds to the residues and where edges connect residues in contacts (i.e. close). The 
matching "1 ^ l',2 ^ 3', 4 ^ 4"', represented by the arrows, yields two common contacts 
which is the maximum for the considered case. However, it also matches residues 1 and 4 
from Pi which are in contacts with residues 1' and 4' in P2 which are remote. 

Tool), an alignment method based on internal distances which is modeled in 
an alignment graph. In DAST, the two proteins Pi and P2 are represented 
by their ordered sets of residues Ni and A^2- Two residues i G Ni and k G 
N2 are compatible if they come from the same kind of secondary structure 
elements (i.e. i and k both come from an a-helix, or from a /3-strand) or 
if both come from a loop. Let us denote by (resp. dk.i) the euclidean 
distance between the a-carbons of residues i and j (resp. k and I). Matching 
z <-> is compatible with matching j ^ I only if \dij — dki\ < r, where r 
is a distance threshold. The longest alignment in terms of residues, in which 
each couple of residues from Pi is matched with a couple of residues from P2 
having similar distance relations, corresponds to a maximum clique in G. Since 

RMSD = ^ X '^Qdij — dfezp), where is the number of matching pairs 
"i <-> fc, j 1", the ahgnments given by DAST have a RMSD of internal distances 

< T. 
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2 Branch and Bound approach 

We have been inspired by to propose our own algorithm which is more suit- 
able for solving the maximum clique problem in the previously defined m x n 
alignment graph G — {V,E). Let Best be the biggest cHque found so far (first 
it is set to 0), and \MCC{G)\ be an over-estimation of \MCC{G)\. By def- 
inition, l/^+i-fc+i c F^ '^'+i C V'-^, and similarly y^+i ^+i c C V'-'' . 
From these inclusions and from definitiorO it is easily seen that for any G" '^, 
MCC{G''^) is the biggest clique among MCC{G^+^-'^), MCC{G^-^+^) and 
MCC{G''^'^-''^'^) U {i.fc}, but for the latter only if vertex i.k is adjacent to 
all vertices in MGC{G'^'^^-^'^^). Let C be a (m + 1) x (n + 1) array where 
C[i][fc] = \MCC{G'^^^)\ (values in row to + 1 or column n + 1 are equal to 
0). For reasoning purpose, let assume that the upper-bounds in C are exact. 
If a vertex i.k is adjacent to all vertices in MCC(G"+^ '^+^), then C[i][fc] = 
l + C[i + l][k + 1], else C\i\[k] = max(CH[A: + 1], C[i + \\[k]). We can deduce 
that a vertex i.k cannot be in a cHque in G" '^ which is bigger than Best if 
C[i + + 1] < jBesij, and this reasoning still holds if values in C are up- 
per estimations. Another important inclusion is V^{i.k) C Even if 
C[i + \\[k + 1] > \Best\, if \MCC{T+ {i.k))\ < \Best\ then i.k cannot be in a 
cHque in G* '"' bigger than Best. 




Our main clique cardinality estimator is constructed and used according to 
these properties. A function, Find_clique(G), will visit the cells of T according 
to north-west to south-est diagonals, from diagonal + k = 171 + n" to diagonal 
"i -\- k — 2" as illustrated in Fig [2b- For each cell r[z][/c] containing a vertex 
i.k G V, it may call Extend_clique({i.fc}, T'^{i.k)), a function which tries to 
extend the clique {i.k} with vertices in r+(i.A:) in order to obtain a clique bigger 
than Best (which cannot be bigger than |Best| +1). If such a clique is found. 
Best is updated. However, Find_clique() will call Extend_cHque() only if two 
conditions are satisfied : (i) G[i + 1] = \Best\ and (ii) \MCC{T+ {i.k))\ > 
\Best\. After the call to Extend_clique(), G[i][/c] is set to \Best\. For all other 
cells T[i][k], C[i\[k] is set to max(G[z][A: + 1], C[i + l][k]) if i.k ^ V, or to 
I + C[i + l][fc + 1]) if i.k G V. Note that the order used for visiting the cells in 
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T guaranties that when computing the value of C[i] [k] , the values of C[i + 1] [k] , 
C[i][fc + 1] and C[i + l][k + 1] are already computed. 

Array C can also be used in function Extend_cHque() to fasten the max- 
imum clique search. This function is a branch a bound (B&B) search using 
the following branching rules. Each node of the B&B tree is characterized by a 
couple {Cli, Cand) where Cli is the clique under construction and Cand is the 
set of candidate vertices to be added to Cli. Each call to Extend_chque({i.A;}, 
r+(i.fc)) create a new B&B tree which root node is ({i.fc}, r+(i.fc)). The succes- 
sors of a B&B node {Cli, Cand) are the nodes {Cli [J{i'-k'}, Candf^r~^{i' .k')), 
for all vertices i'.k' G Cand. Branching follows lexicographic increasing or- 
der (row first). According to the branching rules, for any given B&B node 
{Cli, Cand) the following cutting rules holds : (i) if \Cli\ + \Cand\ < \Best\ 
then the current branch cannot lead to a clique bigger than \Best\ and can be 
fathomed, (ii) if \MCC{Cand)\ < \Best\ — \Cli\, then the current branch can- 
not lead to a clique bigger than |i?esi;|, and (iii) if \MCC{Candf^T~^{i.k))\ < 
\Best\ — \Cli\ — 1, then branching on i.k cannot lead to a clique bigger than 
|i3est|. For any set Cand and any vertex i.k, Candf]T^{i.k) C r+(i.fc) , and 
r+(i.fc) C G'+i '^+i. From these inclusions we can deduce two way of over- 
estimating \MCC{Candf]T+{i.k))\. First, by using C[i + + 1] which over- 
estimate | MCC(G '+^-^+^)| and second, by over-estimating \MCC{T+ {i.k))\. 
All values \MCC{T~^ {i.k))\ are computed once for all in Find_clique() and 
thus, only \MCC{Cand)\ needs to be computed in each B&B node. 
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3 Maximum clique cardinality estimators 

Even if the described functions depend on array C, they also use another upper- 
estimator of the cardinaHty of a maximum cUque in an ahgnment graph. By 
using the properties of ahgnment graphs, we developed the following estimators. 

3.1 Minimum number of rows and columns 

Definition [2] implies that there is no edge between vertices from the same row or 
the same column. This means that in a m x n alignment graph, \MCC{G)\ < 
min(m, n). If the numbers of rows and columns are not computed at the creation 
of the alignment graph, they can be computed in 0(|T/|). 

3.2 Longest increasing subset of vertices 

Definition 3 An increasing subset of vertices in an alignment graph G = 
{y, E} is an ordered subset {ii.ki, 12.^2, • ■ it-kt } ofV, such that^j G [1, t—l], 
ij < ij+i, kj < kj^i. LIS{G) is the longest, in terms of vertices, increasing 
subset of vertices of G. 

Since any two vertices in a clique are adjacent, definition [2] implies that a 
clique in G is an increasing subset of vertices. However, an increasing subset of 
vertices is not necessarily a clique (since vertices are not necessarily adjacent), 
and thus \MCG{G)\ < \LIS{G)\. In a m x n alignment graph G = {¥,£), 
LIS{G) can be computed in 0{nxm) times by dynamic programming. However, 
it is possible by using the longest increasing subsequence to solve LIS{G) in 
0(|y| X ln(|y|)) times which is more suited in the case of sparse graph like in 
our protein structure comparison experiments. 

Definition 4 The longest increasing subsequence of an arbitrary finite se- 
quence of integers S = "ii,i2, ■■■ ,in" is the longest subsequence S' = "i[,i'2, ■ ■ ■ ,i't 
of S respecting the original order of S, and such that for all j € [l,t],i'j < i'j^i- 
By example, the longest increasing subsequence of "1,5,2,3" is "1,2,3". 

For any given alignment graph G — {V, E}, we can easily reorder the vertex 
set V, first by increasing order of columns, and second by decreasing order of 
rows. Let's denote by V this reordered vertex set. Then we can create an integer 
sequence S corresponding to the row indexes of vertices in V . For example, 
by using the alignment graph presented in Fi^JK, the reordered vertex set V is 
{4.1, 2.1, 1.1, 3.2, 4.3, 3.3, 2.3, 1.3, 4.4, 3.4, 1.4}, and the corresponding 
sequence of row indexes 5 is "4, 2, 1, 3, 4, 3, 2, 1, 4, 3, 1". An increasing 
subsequence of S will pick at most one number from a column, and thus an 
increasing subsequence is longest if and only if it covers a maximal number of 
increasing rows. This proves that solving the longest increasing subsequence in S 
is equivalent to solving the longest increasing subset of vertices in G. Note that 
the longest increasing subsequence problem is solvable in time 0{l x ln(^)) [5], 
where I denotes the length of the input sequence. In our case, this corresponds 

to Oi\V\ X H\v\)). 
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3.3 Longest increasing path 

Definition 5 An increasing path in an alignment G = {V, E} is an in- 
creasing subset of vertex (ii-ki, 22-^2, • • •; it-kt} such that \fj € [1,^ — 1], 
{ij.kj,ij+l.kj-^-l) € E. The longest increasing path in G is denoted by LIP{G) 

As the increasing patii take into account edges between consecutive vertices, 
\LIP{G)\, should better estimate MCC{G)\. \LIP{G)\ can be computed in 
0(|y|^) by the following recurrence. Let £)P[i][fc] be the length of the longest in- 
creasing path in G"'*^ containing vertex i./c. DP\i]\k] = 1+ maxj/ fe/gr+j [fc']). 
The sum over all r+(i.fc)) is done in 0{\E\) time complexity, and finding the 
maximum over all Z)P[i][fc] is done in 0(|y|). This results in a + \E\) 

time complexity for computing |-L/P(G')|. 

Amongst all of the previously defined estimators, the longest increasing sub- 
set of vertices (solved using the longest increasing subsequence) exhibits the 
best performances and is the one we used for obtaining the results presented in 
the next section. 
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4 Results 

All results presented in this section come from real protein structure compari- 
son instances. Our algorithm, denoted by ACF (for AHgnment Clique Finder), 
has been implemented in C and was tested in two different contexts: secondary 
structure alignments in VAST and residue alignments in DAST. ACF will be 
compared to Ostergard's algorithm |13j (denoted by Ostergdrd) and to the orig- 
inal VAST clique solver which is based on Bron and Kerbosch's algorithm[4] 
(denoted by BK). Note that BK is not a maximum cHque finder but returns all 
maximal cliques in a graph. 

4.1 Secondary structures alignments 

This section illustrates the behavior of ACF in the context of secondary struc- 
ture element (SSE) alignments. For this purpose we integrated ACF and 
Ostergdrd (which code is freely available) in VAST. We afterwards compared 
them with BK by selecting few large protein chains having between 80 to 90 
SSE's (for smaller protein chains the running times of both Ostergdrd and ACF 
are less than 0.01 sec). Computations were done on a AMD at 2.4 GHz com- 
puter, and the corresponding running times are presented in table [H We ob- 
serve that Ostergdrd is 4053 times faster than BK, and that ACF is about 9.3 
times faster than Ostergdrd. Although we have chosen large protein chains, the 
SSE ahgnment graphs are relatively small (up to 5423 vertices and 551792 edges 
). On such graphs the difference between Ostergdrd and ACF performance is 
not very visible-it will be better illustrated on larger alignment graphs in the 
next section. 

Table 1: Runing time comparison on secondary structure alignment instances 



Instances 


BK (sec.) 


Ostergard (sec.) 


ACF (sec.) 


lk32B 


ln6el 


1591.89 


1.42 


0.09 


lk32B 


ln6ffl 


1546.78 


0.01 


0.01 


lk32B 


ln6fF 


1584.25 


0.14 


0.02 


ln6dD 


lk32B 


1373.35 


0.06 


0.01 


ln6dD 


ln6el 


1390.27 


0.11 


0.03 


ln6dD 


ln6ffl 


1328.85 


0.65 


0.06 


ln6dD 


ln6fF 


1398.41 


0.13 


0.05 



Runing time comparison of BK, Ostergdrd and ACF on secondary structure alignment in- 
stances for long protein chains (containing from 80 to 90 SSE's). BK is notably slower than 
the Ostergard's algorithm, which is slightly slower than ACF. 



4.2 Residues alignment 

In this section we compare ACF to Ostergdrd in the context of residue align- 
ments in DAST. Computations were done on a PC with an Intel Core2 processor 
at 3Ghz, and for both algorithms the computation time was bounded to 5 hours 
per instance. Secondary structures assignments were done by KAKSI[l2], and 
the threshold distance t was set to 3 A. The protein structures come from the 
well known Skolnick set, described in [TT]. It contains 40 protein chains having 
from 90 to 256 residues, classified in SC0P[2j (vl.73) into five families. Amongst 
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the 780 corresponding alignment instances, 164 align protein chains from the 
same family and will be called "similar". The 616 other instances align protein 
chains from different famihes and thus will be called "dissimilar". Characteristics 
of the corresponding alignment graphs are presented in table [2j 



Table 2: DAST alignment graphs characteristics 





array size 


|v| 


|E| 


density 


|MCC| 


similar 
instances 


min 


97x97 


4018 


106373 


8.32% 


45 


max 


256x255 


25706 


31726150 


15.44% 


233 


dissimilar 
instances 


min 


97x104 


1581 


77164 


5.76% 


12 


max 


256x191 


21244 


16839653 


14.13% 


48 



All alignment graphs from DAST have small edge density (less than 16%). Similar instances 
are characterized by bigger maximum cliques than the dissimilar instances. 

Table [3] compares the number of instances solved by each algorithm on Skol- 
nick set. ACF solved 155 from 164 similar instances, while Ostergdrd solved 128 
instances. ACF was able to solve all 616 dissimilar instances, while Ostergdrd 
solved 545 instances only. Thus, on this popular benchmark set, ACF clearly 
outperformed Ostergdrd in terms of number of solved instances. 



Table 3: Number of solved instances comparison 





Ostergard ACF 


Similar instances (164) 
Dissimilar instances (616) 
Total (780) 


128 155 
545 616 
673 771 



Number of solved instances on Skolnick set: ACF solves 21% more similar instances and 13% 
more dissimilar instances than Ostergdrd. 

Figure [3] compares the running time of ACF to the one of Ostergdrd on the 
set of 673 instances solved by both algorithms (all instances solved by Ostergdrd 
were also solved by ACF). For all instances except one, ACF is significantly 
faster than Ostergdrd. More precisely, ACF needed 12 hs. 29 min. 56 sec. 
to solve all these 673 instances, while Ostergdrd needed 260 hs. 10 min. 10 
sec. Thus, on the Skolnick set, ACF is about 20 times faster in average than 
Ostergdrd., (up to 4029 times for some intstances). 
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Figure 3: Running time comparison on the Skolnick set 

100000 — — — — ^ — — ^ 




0.01 0.1 1 10 100 1000 10000 

ACF running time in sec. (iog scaie) 

ACF versus Osle.rgard running time comparison on tlie set of tlie 67,3 Sliolnick instances solved 
by both algoritlims. The ACF time is presented on the x-axis, while the one of Ostergdrd is 
on the y-axis. For all instances except one, ACF is faster than Ostergdrd. 

5 Conclusion and future work 

In this paper we introduce a novel protein structure comparison approach DAST, 
for Distance-based Ahgnment Search Tool. For any fixed threshold r, it finds 
the longest alignment in which each couple of pairs of matched residues shares 
the same distance relation (+ /- r), and thus the RMSD of the alignment is < t. 
This property is not guaranteed by the CMO approach, which inspired initially 
DAST. From computation standpoint, DAST requires solving the maximum 
chque problem in a specific /j-partite graph. By exploiting the peculiar struc- 
ture of this graph, we design a new maximum chque solver which significantly 
outperforms one of the best general maximum clique solver. Our solver was 
successfully integrated into two protein structure comparison softwares and will 
be freely available soon. We are currently studying the quality of DAST align- 
ments from practical viewpoint and compare the obtained results with other 
structure comparison methods. 
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